Skip to content

fix(monitor): fix wandb internal RSS growth at high sample log frequency#2528

Draft
mikasenghaas wants to merge 2 commits into
fix/orchestrator-rss-growthfrom
fix/wandb-rss-growth
Draft

fix(monitor): fix wandb internal RSS growth at high sample log frequency#2528
mikasenghaas wants to merge 2 commits into
fix/orchestrator-rss-growthfrom
fix/wandb-rss-growth

Conversation

@mikasenghaas
Copy link
Copy Markdown
Member

Summary

Following the glibc heap fix (#2527), a secondary RSS accumulation was identified in the orchestrator's wandb sample logging path. This PR tracks investigation and fix.

Bug: WandbMonitor.log_samples() causes ~2.5 MB/step RSS growth when log_extras.interval=1. At the default interval=10 this is ~0.3 MB/step — slow but unbounded over long runs.

Root cause (partially understood): The accumulation is inside wandb's internals — not the Python wandb.Table object itself. Resetting self.samples_table after each wandb.log() call (which clears table.data) had no measurable effect on RSS, ruling out the Python table object as the source. The leak is somewhere in wandb's serialization/upload layer.

Reproduction

# With sample logging every step (exaggerated to make leak visible faster)
uv run rl @ examples/alphabet_sort/rl.toml --max-steps 20 --clean-output-dir \
  --orchestrator.wandb.log-extras.interval 1

# Monitor post-trim RssAnon (replace PID)
watch -n3 "grep -m1 RssAnon /proc/<PID>/status"

Data

Post-trim RssAnon per completed step (alphabet-sort, 512 rollouts/step):

Step No wandb (baseline) wandb interval=1
2 941 MB 941 MB
3 941 MB (=) 941 MB (=)
4 941 MB (=) 944 MB (+3)
5 942 MB (=) 946 MB (+5)
6 950 MB (+9)
7 954 MB (+13)
8 958 MB (+17)

No-wandb baseline is flat. With wandb at interval=1: ~2.5 MB/step monotonic drift that malloc_trim cannot reclaim (Python/wandb heap, not glibc).

At interval=10 (default) the drift is ~0.3 MB/step — approximately 18 MB per hour at a 60s/step pace.

What was ruled out

mikasenghaas and others added 2 commits May 17, 2026 22:17
numpy/pandas allocate array data via malloc() (outside Python's allocator),
so gc.collect() alone doesn't reclaim RSS after per-step DataFrames are freed.
glibc retains freed pages in its internal pool, causing ~+6 MB/step monotonic
RssAnon growth on the orchestrator process.

malloc_trim(0) forces glibc to return freed heap pages to the OS, producing
a stable sawtooth pattern (peak during rollout generation, drops after trim)
with no upward trend.

Verified on alphabet-sort (512 rollouts/step, 20 steps, no wandb):
- Without fix: 941 MB → 973 MB by step 3 (+6 MB/step, unbounded)
- With fix:    941 MB → 942 MB by step 5 (flat, sawtooth only)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@mikasenghaas mikasenghaas force-pushed the fix/orchestrator-rss-growth branch 2 times, most recently from ff4eb04 to faf31fa Compare May 17, 2026 22:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant